Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36864624

RESUMO

MOTIVATION: High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. RESULTS: Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. AVAILABILITY AND IMPLEMENTATION: The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala
2.
Bioinformatics ; 37(19): 3358-3360, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33787870

RESUMO

SUMMARY: Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/refresh-bio/vcfshark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
Bioinformatics ; 35(22): 4791-4793, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31225861

RESUMO

SUMMARY: Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results. AVAILABILITY AND IMPLEMENTATION: https://github.com/refresh-bio/GTShark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Genômica , Genótipo , Humanos , Software
4.
Bioinformatics ; 35(1): 133-136, 2019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29986074

RESUMO

Summary: Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40 715 pathogens in <7 min (on a modern workstation), 26 times faster than Mash, its main competitor. Availability and implementation: https://github.com/refresh-bio/kmer-db and http://sun.aei.polsl.pl/REFRESH/kmer-db. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Evolução Biológica , Biologia Computacional , Software , Genoma
5.
Bioinformatics ; 34(11): 1834-1840, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29351600

RESUMO

Motivation: Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results: We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second. Availability and implementation: GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact: sebastian.deorowicz@polsl.pl. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Genômica/métodos , Genótipo , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bases de Dados Genéticas , Técnicas de Genotipagem/métodos , Haplótipos , Humanos
6.
PLoS Genet ; 11(10): e1005579, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26474060

RESUMO

Gene retroposition leads to considerable genetic variation between individuals. Recent studies revealed the presence of at least 208 retroduplication variations (RDVs), a class of polymorphisms, in which a retrocopy is present or absent from individual genomes. Most of these RDVs resulted from recent retroduplications. In this study, we used the results of Phase 1 from the 1000 Genomes Project to investigate the variation in loss of ancestral (i.e. shared with other primates) retrocopies among different human populations. In addition, we examined retrocopy expression levels using RNA-Seq data derived from the Ilumina BodyMap project, as well as data from lymphoblastoid cell lines provided by the Geuvadis Consortium. We also developed a new approach to detect novel retrocopies absent from the reference human genome. We experimentally confirmed the existence of the detected retrocopies and determined their presence or absence in the human genomes of 17 different populations. Altogether, we were able to detect 193 RDVs; the majority resulted from retrocopy deletion. Most of these RDVs had not been previously reported. We experimentally confirmed the expression of 11 ancestral retrogenes that underwent deletion in certain individuals. The frequency of their deletion, with the exception of one retrogene, is very low. The expression, conservation and low rate of deletion of the remaining 10 retrocopies may suggest some functionality. Aside from the presence or absence of expressed retrocopies, we also searched for differences in retrocopy expression levels between populations, finding 9 retrogenes that undergo statistically significant differential expression.


Assuntos
Evolução Molecular , Duplicação Gênica , Genoma Humano , Polimorfismo Genético , Animais , Regulação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Projeto Genoma Humano , Humanos , Primatas/genética
7.
Sci Rep ; 5: 11565, 2015 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-26108279

RESUMO

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Compressão de Dados/métodos , Genômica/métodos , Arabidopsis/genética , Genoma Humano/genética , Genoma de Planta/genética , Humanos , Reprodutibilidade dos Testes , Análise de Sequência de DNA/métodos
9.
BMC Plant Biol ; 14: 329, 2014 Dec 05.
Artigo em Inglês | MEDLINE | ID: mdl-25476999

RESUMO

BACKGROUND: For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, 'omics', and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. RESULTS: We show that the automatic gene annotations of potato have low accuracy when compared to a "gold standard" based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. CONCLUSIONS: We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of '-omics' datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.


Assuntos
Genoma de Planta , Anotação de Sequência Molecular , Solanum tuberosum/genética
10.
PLoS One ; 9(10): e109384, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25289699

RESUMO

The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16-32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.


Assuntos
Indexação e Redação de Resumos , Computadores , Genoma , Genômica , Algoritmos , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , Genômica/métodos
11.
Bioinformatics ; 29(20): 2572-8, 2013 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-23969136

RESUMO

MOTIVATION: Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms. RESULTS: We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ∼400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions. AVAILABILITY: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license. Supplementary data: Supplementary data are available at Bioinformatics online.


Assuntos
Arabidopsis/genética , Compressão de Dados/métodos , Genoma Humano , Genoma de Planta , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Sequência de Bases , Bases de Dados Genéticas , Genômica/métodos , Humanos , Análise de Sequência de DNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...